-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using WindowsFS in ClusterRerouteIT #52488
Avoid using WindowsFS in ClusterRerouteIT #52488
Conversation
Issue elastic#52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
Pinging @elastic/es-distributed (:Distributed/Cluster Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Nice find and a good experiment. If successful, I would prefer that we dig a bit into the WindowsFS before we disable it for this test permanently (or alternatively increase the timeout when running that FS). Looks like there is at least a bit of IO done under a lock that could be improved (though in Lucene).
Issue #52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
Issue elastic#52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
Issue #52000 looks like a case of cluster state updates being slower than expected, but it seems that these slowdowns are relatively rare: most invocations of `testDelayWithALargeAmountOfShards` take well under a minute in CI, but there are occasional failures that take 6+ minutes instead. When it fails like this, cluster state persistence seems generally slow: most are slower than expected, with some small updates even taking over 2 seconds to complete. The failures all have in common that they use `WindowsFS` to emulate Windows' behaviour of refusing to delete files that are still open, by tracking all files (really, inodes) and validating that deleted files are really closed first. There is a suggestion that this is a little slow in the Lucene test framework [1]. To see if we can attribute the slowdown to that common factor, this commit suppresses the use of `WindowsFS` for this test suite. [1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166
Same as elastic#52488 but for a different test suite Closes elastic#58019
Same as elastic#52488 but for a different test suite Closes elastic#58019
Same as elastic#52488 but for a different test suite Closes elastic#58019
Issue #52000 looks like a case of cluster state updates being slower than
expected, but it seems that these slowdowns are relatively rare: most
invocations of
testDelayWithALargeAmountOfShards
take well under a minute inCI, but there are occasional failures that take 6+ minutes instead. When it
fails like this, cluster state persistence seems generally slow: most are
slower than expected, with some small updates even taking over 2 seconds to
complete.
The failures all have in common that they use
WindowsFS
to emulate Windows'behaviour of refusing to delete files that are still open, by tracking all
files (really, inodes) and validating that deleted files are really closed
first. There is a suggestion that this is a little slow in the Lucene test
framework [1]. To see if we can attribute the slowdown to that common factor,
this commit suppresses the use of
WindowsFS
for this test suite.[1] https://github.com/apache/lucene-solr/blob/4a513fa99f638cb65e0cae59bfdf7af410c0327a/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L166